# A1: Data Curation
Karl Stavem  
ID:  1978397

---
> TL;DR: The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1, 2008 through August 30, 2020.
---

### Step 1: Gathering the data

In [29]:
# import libraries
import os
import requests
import json
import pandas as pd

In [None]:
# 5 JSON files
#  One file each corresponding to the 2 query types in the legacy pagecount API (desktop, mobile)
# 3 query types in the newer pageviews API (desktop, mobile wbe, mobile app)
# Jan 2008 - August 2020

First, make sure header information is set for each API call.

In [79]:
# set headers for API calls
headers = {
    'User-Agent': 'https://github.com/stavem',
    'From': 'kstavem@uw.edu'
}

Since there are two separate APIs, define the endpoints.

In [80]:
# set the API endpoint
endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
endpoint_new = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

Because the parameters are close but the syntax is slightly different between the two APIs, create a method to quickly switch between the two.   For this exercise, most of our parameters remain constant so we only need to change the version and access site to generate new parameters.

In [88]:
def assign_parameters(is_legacy_version, access, project = "en.wikipedia.org", agent = "user", granularity = "monthly", 
                      start = '2008010100', end =  '2020090100'):
    """
    Input:  Access and Version
        Possible versions are "legacy" and "new"
    Output:  Parameters for API query
    """
    
    if is_legacy_version:
                      params = {"project" : project,
                                "access-site" : access,
                                "granularity" : granularity,
                                "start" : start,
                                "end" : end
                               }

    else:
                      params = {"project" : project,
                                "access" :access,
                                "agent" : agent,
                                "granularity" : granularity,
                                "start" : start,
                                "end" : end
                               }
    return params

In [117]:
# define the 5 sites we'll be looking at
new_sites = ['desktop', 'mobile-web', 'mobile-app']
legacy_sites = ['dektop-site', 'mobile-site']

In [22]:
def api_call(endpoint,parameters):
    """
    Method to conduct API calls
    Inputs:  Wikipedia API Endpoint, Parameter for API call
    Output:  Response in JSON format
    """
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [119]:
for site in new_sites:
    # call the API
    params = assign_parameters(is_legacy_version = 0, access = site)
    data = api_call(endpoint_new, params)
    
    # write to file
    with open('data-512/data-512-a1/pagecounts_{}_200712-202008.json'.format(site), 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)
    
    

In [121]:
for site in legacy_sites:
    # call the API
    params = assign_parameters(is_legacy_version = 1, access = site)
    data = api_call(endpoint_legacy, params)
    
    # write to file
    with open('data-512/data-512-a1/legacy_{}_200712-202008.json'.format(site), 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

In [110]:
assign_parameters(is_legacy_version = 1, access = "desktop-site")

{'project': 'en.wikipedia.org',
 'access-site': 'desktop-site',
 'granularity': 'monthly',
 'start': '2008010100',
 'end': '2020090100'}

In [115]:
new_sites = ['desktop', 'mobile-web', 'mobile-app']
legacy_sites = ['dektop-site', 'mobile-site']

In [116]:
for site in new_sites:
    print(site)

desktop
mobile-web
mobile-app


In [113]:
data_legacy_desktop = api_call(endpoint_legacy, assign_parameters(is_legacy_version = 1, access = 'desktop-site'))
data_legacy_mobile_site = api_call(endpoint_legacy, assign_parameters(is_legacy_version = 1, access = 'mobile-site'))


In [99]:
with open('data-512/data-512-a1/pagecounts_desktop-site_200712-202008.json', 'w', encoding='utf-8') as f:
    json.dump(data_new_desktop, f, ensure_ascii=False, indent=4)

In [62]:
df = pd.DataFrame(data_legacy)['items']
df

Unnamed: 0,project,access-site,granularity,timestamp,count
0,en.wikipedia,desktop-site,monthly,2008010100,4930902570
1,en.wikipedia,desktop-site,monthly,2008020100,4818393763
2,en.wikipedia,desktop-site,monthly,2008030100,4955405809
3,en.wikipedia,desktop-site,monthly,2008040100,5159162183
4,en.wikipedia,desktop-site,monthly,2008050100,5584691092
...,...,...,...,...,...
99,en.wikipedia,desktop-site,monthly,2016040100,5572235399
100,en.wikipedia,desktop-site,monthly,2016050100,5330532334
101,en.wikipedia,desktop-site,monthly,2016060100,4975092447
102,en.wikipedia,desktop-site,monthly,2016070100,5363966439


In [66]:
print(max(df['timestamp']))
print(min(df['timestamp']))

2016080100
2008010100



### Step 2: Processing the data

In [None]:
df = pd.DataFrame(data_legacy, columns= ['project',  'access-site','granularity','timestamp','count'])

### Step 3: Analyze the data

### Step 4: Document your process thoroughly

---
### Step 5: Submit the assignment

The github link for this assignment can be found here:   <a href="https://github.com/stavem/data-512/tree/main/data-512-a1">here</a>.

The github link for this assignment can be found here:   <a href="https://github.com/stavem/data-512/tree/main/data-512-a1">here</a>.