# Homework 1
The goal of this assignment is to construct, analyze, and publish a dataset of monthly article traffic for a select set of pages from English Wikipedia from July 1, 2015 through September 30, 2023. Your notebook(s) and your data files will be uploaded to a repository of your choosing. You will submit a  link to your repository to enable grading of this assignment. The purpose of the assignment is to develop and follow best practices for open scientific research.

# Step 1: Data Acquisition
In order to measure article traffic from 2015-2023, you will need to collect data from the Pageviews API. The Pageviews API (documentation, endpoint) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through the previous complete month.

To get you started, you can refer to this example notebook that contains sample code to make the Pageviews Wikipedia API call. This sample code is licensed CC-BY, please feel free to reuse any of the code in the example notebook with appropriate attribution.

You will be collecting counts of pageviews using a specified subset of Wikipedia article pages. This is a subset of the English Wikipedia that represents a large number of articles about academy award winning movies.
You will use the same article subset to create several related data sets. All of the data sets are time series of monthly activity. For all of the data sets we are only interested in actual user pageview requests. The three resulting datasets should be saved as JSON files ordered using article titles as a key for the resulting time series data. You should store the time series data as returned from the API, with the exception of removing the ‘access’ field as it is misleading for mobile and cumulative files.

## 1.i. Importing the required libraries
- We start by importing all the necessary libraries

In [1]:
!pip install tqdm
!pip install pandas

In [2]:
import json, time, urllib.parse
import requests
import pandas as pd
import numpy as np
import re
import os
from tqdm import tqdm

## 1.ii. Configuring the API parameters
### License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 14, 2023

- We read the cleaned .csv file locally

In [3]:
df = pd.read_csv(os.path.join("..","data","thank_the_academy.AUG.2023.csv"))

In [4]:
# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include a "unique ID" that will allow them to
# contact you if something happens - such as - your code exceeding request limits - or some other error happens

REQUEST_HEADERS = {
    'User-Agent': '<sagnik99@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# The name of the academy movie from the input file is saved as a list of Article Titles
ARTICLE_TITLES = df['name']

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this will be changed for different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",   # For this examples the start date is July 2015
    "end":         "2023093000"    # For this example the end date is Sept 2023
}

- The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages.

In [5]:
def request_pageviews_per_article(article_title = None, 
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")
    
    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

## 1.iii. Extracting data from Wikipedia
- The below code runs through various access types and fetches data for all the given article titles. We have added a try & except block to ensure the code does not break due to a failure or latency at the server end.

In [6]:
access_type = ['mobile-app', 'mobile-web', 'desktop', 'all-access']
views_list = []
for i in tqdm(range(len(ARTICLE_TITLES))):
    try: 
        for j in access_type:
            ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['access'] = j
            views = request_pageviews_per_article(ARTICLE_TITLES[i])
            views_list.append(pd.json_normalize(views['items']))
    except:
        print("Could not get data for: ",i)

100%|██████████████████████████████████████████████████████████████████████████████| 1359/1359 [15:20<00:00,  1.48it/s]


In [7]:
# Converting the list into a dataframe for easier data manipulation
df_views = pd.concat(views_list)

## 1.iv. Creating JSON files for each access type

1. **Monthly mobile access** - The API separates mobile access types into two separate requests, you will need to sum these to make one count for all mobile pageviews. You should store the mobile access data in a file called:
academy_monthly_mobile_&lt;startYYYYMM&gt;-&lt;endYYYYMM&gt;.json

2. **Monthly desktop access** - Monthly desktop page traffic is based on one single request. You should store the desktop access data in a file called:
academy_monthly_desktop_&lt;startYYYYMM&gt;-&lt;endYYYYMM&gt;.json

3. **Monthly cumulative** - Monthly cumulative data is the sum of all mobile, and all desktop traffic per article. You should store the monthly cumulative data in a file called:
academy_monthly_cumulative_&lt;startYYYYMM&gt;-&lt;endYYYYMM&gt;.json

For all of the files the &lt;startYYYYMM&gt; and &lt;endYYYYMM&gt; represent the starting and ending year and month as integer text.

In [8]:
df_views

Unnamed: 0,project,article,granularity,timestamp,access,agent,views
0,en.wikipedia,Everything_Everywhere_All_at_Once,monthly,2020010100,mobile-app,user,65
1,en.wikipedia,Everything_Everywhere_All_at_Once,monthly,2020020100,mobile-app,user,152
2,en.wikipedia,Everything_Everywhere_All_at_Once,monthly,2020030100,mobile-app,user,120
3,en.wikipedia,Everything_Everywhere_All_at_Once,monthly,2020040100,mobile-app,user,284
4,en.wikipedia,Everything_Everywhere_All_at_Once,monthly,2020050100,mobile-app,user,231
...,...,...,...,...,...,...,...
94,en.wikipedia,Zorba_the_Greek_(film),monthly,2023050100,all-access,user,13025
95,en.wikipedia,Zorba_the_Greek_(film),monthly,2023060100,all-access,user,12631
96,en.wikipedia,Zorba_the_Greek_(film),monthly,2023070100,all-access,user,20739
97,en.wikipedia,Zorba_the_Greek_(film),monthly,2023080100,all-access,user,19522


- We create a function to perform the required operations and writing it into JSON

In [9]:
def write_to_json(df, access_type):
    if access_type == 'cumulative':
        cols = df.columns.to_list()
        cols.remove('views')
        df = df.groupby(cols).sum().groupby('article').cumsum().reset_index()

    if access_type == 'mobile':
        cols = df.columns.to_list()
        cols.remove('views')
        df = df.groupby(cols).agg({'views' : np.sum}).reset_index()

    df = df.sort_values(by=['article', 'timestamp'], ascending=True)
    output = df.to_json(orient='records')[1:-1].replace('},{', '} {')
    output = '['+output+']'
    output = re.sub("}\s{", "},{", output)
    parsed = json.loads(output)
    json_object = json.dumps(parsed, indent=4)

    data_folder = "data"
    filename = f"academy_monthly_{access_type}_201507-202309.json"
    path = os.path.join("..", data_folder, filename)
    with open(path, 'w') as f:
        f.write(json_object)
    print("Done for: ", access_type)

In [10]:
def output_json(df):
    df.reset_index(inplace=True)
    write_to_json(df[df['access']=='desktop'].drop(['access','index'], axis = 1), 'desktop')
    write_to_json(df[(df['access']=='mobile-app') | (df['access'] == 'mobile-web')].drop(['access','index'], axis = 1), 'mobile')
    write_to_json(df[df['access']=='all-access'].drop(['access','index'], axis = 1), 'cumulative')

In [11]:
# Calling the function and passing the dataframe which contains the data as dumped by the API call.
output_json(df_views)

Done for:  desktop
Done for:  mobile
Done for:  cumulative


### Please execute the Data_Analysis.ipynb file after this