# Data Acquisition using Wikipedia's APIs

## Overview

### This notebook contains the step by step code and related documentation to acquire the monthly desktop and mobile traffic on the English wikipedia

In order to measure Wikipedia traffic from 2008-2017, we will need to collect data from two different API endpoints, the Pagecounts API and the Pageviews API.

1. The legacy Pagecounts API [documentation](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts), [endpoint](https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_legacy/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end) provides access to desktop and mobile traffic data from January 2008 through July 2016. For the current work, I have chosen to get the data starting July 2008. If you are following along by running the code presented in this document yourself and are interested to instead get this data starting January 2008, then all that is required would be to modify the first element of the PageCounts_DateRange variable value from '2008010100' to '2008070100', which means the variable value will be ['2008010100','2016080100']

2. The Pageviews API [documentation](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews), [endpoint](https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through September 2017.

First, we import the necessary modules or libraries that we would need for running some of the code steps in this notebook.

In [1]:
#requests module would be used to retrieve the data from the REST API endpoints
import requests
#Periodically we would need a way to check the intermediate results. We do that by printing the values of the variables.
#This is done using the display module from the IPython.core.display library.
from IPython.core.display import display

We then introduce some variables and assign values as appropriate to these. This is for convenience purpose, so that if we want to run through these scripts for a different set of conditions (that can be controlled through the values of these variables), then we can easily do so by updating these values as necessary. Rest of the code is not required to be modified. 

Note that the Date format that is required for the wikipedia APIs is YYYYMMDD00 where YYYY is 4-digit year, MM is 2-digit month, DD is 2-digit date and is followed by two trailing zeros. We would use this format through out the acquisition phase. 

In [2]:
#Pagecounts data is available from January 2008 but is being pulled only from July 2008 for this work. 
#If you are following along and wish to instead pull data from January 2008, then change the PageCounts_DateRange as below:
#PageCounts_DateRange = ['2008010100','2016080100']
#no other changes should be necessary.

#Start and End Dates for which the PageCounts data is pulled. 
PageCounts_DateRange = ['2008070100','2016080100']

#Start and End Dates for which the PageViews data is pulled.
PageViews_DateRange = ['2015070100','2017100100']

#Directory where the raw data (that are pulled using the APIs) will be saved to at the end of the acquisition phase.
raw_data_dir = './data/raw/'

#REST API endpoint for pageviews
pageviews_endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

#REST API endpoint for pagecounts
pagecounts_endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'

#header values that are required to be passed to the API. 
#NOTE: You are strongly advised to modify these values to point to your github url and account if you plan on running this code
headers={'User-Agent' : 'https://github.com/sumanbhagavathula', 'From' : 'sumanbh@uw.edu'}

We now begin the process of querying the REST API endpoints to retrieve the information that we are looking for. Note the pattern for all API invocations remains pretty much the same: 

1. Define and initialize a list that will (at the end of the step, if it is successful) contain the raw data that is being pulled from the API. 
2. Initializing the params object with the required attributes and values. Note, most of the attributes are common across the two API endpoints. However there are a few differences as will be highlighted in the code comments.
3. The actual API call. Note that the endpoint has already been defined in the variables declaration section at the beginning of this notebook. We use requests.get to make the API call
4. The output from the API is retrieved and converted into JSON format, and is then saved to the list that was defined in the first step
5. The data contained in this variable is then serialized to a file for making it available offline to any downstream steps that may not want to rerun the steps in this notebook and can instead read from the raw data folder in this repository.

We start with the process of retrieving the Mobile web page views information. The pattern is similar to the five-step process described above. Note, the file name where the data is being stored to is hardcoded and you may wish to change it as you desire. However, any further downstream processing or analysis steps also may need to be modified accordingly.

In [3]:
#initialize a list which is a placeholder and will contain the data retrieved from the pageviews API, if successful.
PageViews_MobileWeb_RawData=[]

#Initialize the required parameters. NOTE: We are pulling only views for users and excluding the spiders or crawlers.
params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-web',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '' + PageViews_DateRange[0] + '',
            'end' : '' + PageViews_DateRange[1] + '' #use the first day of the following month to ensure a full month of data is collected
            }
    
#Generate the API call object and retrieves the API response object
api_call = requests.get(pageviews_endpoint.format(**params))
#Convert the output to JSON format and store into the variable declared at the beginning of this code snippet.
PageViews_MobileWeb_RawData = api_call.json()['items']
#Optionally, display the raw page views data for mobileweb
display(PageViews_MobileWeb_RawData)
#Save the raw data into a file in the appropriate raw data repository
open(raw_data_dir + 'pageviews_mobile-web_201507-201709.json','w').writelines(str(PageViews_MobileWeb_RawData))

[{'access': 'mobile-web',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015070100',
  'views': 3179131148},
 {'access': 'mobile-web',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015080100',
  'views': 3192663889},
 {'access': 'mobile-web',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015090100',
  'views': 3073981649},
 {'access': 'mobile-web',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015100100',
  'views': 3173975355},
 {'access': 'mobile-web',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015110100',
  'views': 3142247145},
 {'access': 'mobile-web',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015120100',
  'views': 3276836351},
 {'access': 'mobile-web',
  'agent': 'user',
  'granularity': 'monthly

In the next step, we retrieve the Mobile App page views information. Note, The access attribute of the params object has been modified accordingly.

In [4]:
#initialize a list which is a placeholder and will contain the data retrieved from the pageviews API, if successful.
PageViews_MobileApp_RawData=[]

#Initialize the required parameters. NOTE: We are pulling only views for users and excluding the spiders or crawlers.
params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-app',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '' + PageViews_DateRange[0] + '',
            'end' : '' + PageViews_DateRange[1] + '' #use the first day of the following month to ensure a full month of data is collected
            }
    
#Generate the API call object and retrieves the API response object
api_call = requests.get(pageviews_endpoint.format(**params))
#Convert the output to JSON format and store into the variable declared at the beginning of this code snippet.
PageViews_MobileApp_RawData = api_call.json()['items']
#Optionally, display the raw page views data for mobileweb    
display(PageViews_MobileApp_RawData)
#Save the raw data into a file in the appropriate raw data repository
open(raw_data_dir + 'pageviews_mobile-app_201507-201709.json','w').writelines(str(PageViews_MobileApp_RawData))

[{'access': 'mobile-app',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015070100',
  'views': 109624146},
 {'access': 'mobile-app',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015080100',
  'views': 109669149},
 {'access': 'mobile-app',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015090100',
  'views': 96221684},
 {'access': 'mobile-app',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015100100',
  'views': 94523777},
 {'access': 'mobile-app',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015110100',
  'views': 94353925},
 {'access': 'mobile-app',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015120100',
  'views': 99438956},
 {'access': 'mobile-app',
  'agent': 'user',
  'granularity': 'monthly',
  'proj

Next we retrieve the page views raw data for desktop users. All other objects remain unmodified except for the access attribute in the params object, which is now changed to 'desktop'

In [5]:
#initialize a list which is a placeholder and will contain the data retrieved from the pageviews API, if successful.
PageViews_Desktop_RawData=[]

#Initialize the required parameters. NOTE: We are pulling only views for users and excluding the spiders or crawlers.
params = {'project' : 'en.wikipedia.org',
            'access' : 'desktop',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '' + PageViews_DateRange[0] + '',
            'end' : '' + PageViews_DateRange[1] + '' #use the first day of the following month to ensure a full month of data is collected
            }

#Generate the API call object and retrieves the API response object
api_call = requests.get(pageviews_endpoint.format(**params))
#Convert the output to JSON format and store into the variable declared at the beginning of this code snippet.
PageViews_Desktop_RawData = api_call.json()['items']
#Optionally, display the raw page views data for mobileweb    
display(PageViews_Desktop_RawData)
#Save the raw data into a file in the appropriate raw data repository
open(raw_data_dir + 'pageviews_desktop_201507-201709.json','w').writelines(str(PageViews_Desktop_RawData))

[{'access': 'desktop',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015070100',
  'views': 4376666686},
 {'access': 'desktop',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015080100',
  'views': 4332482183},
 {'access': 'desktop',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015090100',
  'views': 4485491704},
 {'access': 'desktop',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015100100',
  'views': 4477532755},
 {'access': 'desktop',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015110100',
  'views': 4287720220},
 {'access': 'desktop',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015120100',
  'views': 4100012037},
 {'access': 'desktop',
  'agent': 'user',
  'granularity': 'monthly',
  'project': 'en.w

We now switch to the PageCounts data. As noted earlier, this requires to invoke a different wikipedia API endpoint. Also, this legacy API does not contain any way to separate out the organic user traffic from any crawlers or spiders. And hence the counts retrieved are for both type of users.   

The API invokation mechanism and the data retrieval process remains pretty much the same five steps as described earlier, except that there will be some deviations such as the one mentioned in the previous line.

First in this process, we get the desktop user visit counts as follows:

In [6]:
#initialize a list which is a placeholder and will contain the data retrieved from the pagecounts API, if successful.
PageCounts_Desktop_RawData=[]
#Initialize the required parameters. NOTE: We are pulling only views for users and excluding the spiders or crawlers.
params = {'project' : 'en.wikipedia.org',
            'access' : 'desktop-site',
            'granularity' : 'monthly',
            'start' : '' + PageCounts_DateRange[0] + '',
            'end' : '' + PageCounts_DateRange[1] + '' #use the first day of the following month to ensure a full month of data is collected
            }
#Generate the API call object and retrieves the API response object
api_call = requests.get(pagecounts_endpoint.format(**params))
#Convert the output to JSON format and store into the variable declared at the beginning of this code snippet.
PageCounts_Desktop_RawData = api_call.json()['items']
#Optionally, display the raw page views data for mobileweb
display(PageCounts_Desktop_RawData)
#Save the raw data into a file in the appropriate raw data repository
open(raw_data_dir + 'pagecounts_desktop-site_200807-201607.json','w').writelines(str(PageCounts_Desktop_RawData))

[{'access-site': 'desktop-site',
  'count': 5306302874,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2008070100'},
 {'access-site': 'desktop-site',
  'count': 5140155519,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2008080100'},
 {'access-site': 'desktop-site',
  'count': 5479533823,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2008090100'},
 {'access-site': 'desktop-site',
  'count': 5679440782,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2008100100'},
 {'access-site': 'desktop-site',
  'count': 5415832071,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2008110100'},
 {'access-site': 'desktop-site',
  'count': 5211708451,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2008120100'},
 {'access-site': 'desktop-site',
  'count': 5802681551,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2009010100'},

And in the last step we retrieve the page views count for the mobile users. Note that, there was no separation like web and app in the mobile usage at that time. Also, this data is only available from mid-2014.

In [7]:
#initialize a list which is a placeholder and will contain the data retrieved from the pagecounts API, if successful.
PageCounts_Mobile_RawData=[]
#Initialize the required parameters. NOTE: We are pulling only views for users and excluding the spiders or crawlers.
params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-site',
            'granularity' : 'monthly',
            'start' : '' + PageCounts_DateRange[0] + '',
            'end' : '' + PageCounts_DateRange[1] + '' #use the first day of the following month to ensure a full month of data is collected
            }
#Generate the API call object and retrieves the API response object
api_call = requests.get(pagecounts_endpoint.format(**params))
#Convert the output to JSON format and store into the variable declared at the beginning of this code snippet.
PageCounts_Mobile_RawData = api_call.json()['items']
#Optionally, display the raw page views data for mobileweb
display(PageCounts_Mobile_RawData)
#Save the raw data into a file in the appropriate raw data repository
open(raw_data_dir + 'pagecounts_mobile-site_201410-201607.json','w').writelines(str(PageCounts_Mobile_RawData))

[{'access-site': 'mobile-site',
  'count': 3091546685,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2014100100'},
 {'access-site': 'mobile-site',
  'count': 3027489668,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2014110100'},
 {'access-site': 'mobile-site',
  'count': 3278950021,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2014120100'},
 {'access-site': 'mobile-site',
  'count': 3485302091,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015010100'},
 {'access-site': 'mobile-site',
  'count': 3091534479,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015020100'},
 {'access-site': 'mobile-site',
  'count': 3330832588,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015030100'},
 {'access-site': 'mobile-site',
  'count': 3222089917,
  'granularity': 'monthly',
  'project': 'en.wikipedia',
  'timestamp': '2015040100'},
 {'acc

Now that we have retrieved all the required raw data for our study from the wikipedia API endpoints, there are a few points to note:

1. The code in this notebook is only relevant as of this code publish date. The structural integrity and success of this code depends a lot on the other components such as the availability and the signatures of the wikipedia REST API endpoints. If any of the endpoints behavior and/or the endpoint itself changes then this code may break and require manual intervention. Since this code is not being actively updated, please reach out to the owner of the code for any break fix related issues.

2. The above code steps in this notebook are not designed to handle any error. In an event this code stop working for some unforeseen reasons, the raw data can be obtained by downloading the data files from this repository.

3. The API invoke steps in the above notebook are pretty much independent and there is no specific order in which we need to retrieve page views or page counts information. The order chosen in this notebook is purely arbitrary.

If you are interested in following along, please go through the next document in this process, which contains the data processing steps. 