# Data Acquisition

In this notebook, we will work only on extracting the data for the Wikipedia pages that is provided in the file `rare-diseases_cleaned.AUG.2024.csv`. The data acquisition and data analysis is kept separate to improve user experience and reproducibility. Data collection takes 30-40m on its own, so keeping it separate allows users who are only interested in data analysis to run it separately without having to wait for 30-40m.

## Setup and loading the necessary packages
Note - this analysis was done in Google Colab, which is a Jupyter Notebook setup that runs in cloud. It gets the data files from your Google Drive, for which it requires the below snipped works by making the data files available in Google Drive.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Change to the required path where all the data files are located. This will differ for each user.

In [None]:
%cd 'drive/MyDrive/data 512'

/content/drive/MyDrive/data 512


Here, we load the necessary packages that we'll be using to fetch the data.

In [None]:
# These are standard python modules
import json, time, urllib.parse

# The 'requests' and 'python' module is not a standard Python module.
# You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

## Utility functions definition
The example relies on some constants that help make the code a bit more readable.

In [None]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error
REQUEST_HEADERS = {
    'User-Agent': '<raaguln@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2024101000"    # this is likely the wrong end date
}

# We use the above params to make some more client-specific params
# We make use of Python feature called destructuring dictionaries to do this
# elegantly.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILEAPP = {
    **ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP,
    "access":      "mobile-app",
}

ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILEWEB = {
    **ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP,
    "access":      "mobile-web",
}

ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_ALLACCESS = {
    **ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP,
    "access":      "all-access",
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [None]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article(article_title = None,
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT,
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS,
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP,
                                  headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'), safe='')
    # print(article_title_encoded)
    request_template['article'] = article_title_encoded

    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Above output should show dictionaries with views per month

## Data fetching and cleaning
In this section, we load the rare diseases data and make the API calls to Wikimedia backend to fetch the pageviews information.

In [None]:
# Use the pandas library to load the CSV file with the metadata for diseases
# which we use to make the API call
articles = pd.read_csv("rare-disease_cleaned.AUG.2024.csv")

In [None]:
# Exploratory analysis of the structure of data by looking at the first few entries
articles.head()

Unnamed: 0,disease,pageid,url
0,Klinefelter syndrome,19833554,https://en.wikipedia.org/wiki/Klinefelter_synd...
1,Aarskog–Scott syndrome,7966521,https://en.wikipedia.org/wiki/Aarskog–Scott_sy...
2,Abetalipoproteinemia,68451,https://en.wikipedia.org/wiki/Abetalipoprotein...
3,MT-TP,20945466,https://en.wikipedia.org/wiki/MT-TP
4,Ablepharon macrostomia syndrome,10776100,https://en.wikipedia.org/wiki/Ablepharon_macro...


We sort the `articles` data based on the disease name so that the final JSON that we produce at the end of this notebook is always in a predictable order regardless of what order the diseases appear in the file `rare-diseases_cleaned.AUG.2024.csv`.

In [None]:
# Sort the data based on the disease name
articles = articles.sort_values('disease')

The below snippet makes the API call to get all 4 data for each page - desktop, mobile app, mobile web and cumulative pageviews. It takes 30-40 minutes to run the below snippet, so patiently wait for it to complete.

Note - we only get the data and do not do any sort of preprocessing methods here to keep the concerns separate - Each code block does one thing, and it does its thing perfectly well to maintain replicability and easy tracability. In this case, this block can only fail if there's some issue on the API side - either the API response format changed, or if the API is down. This makes it easy to debug and replicate.

In [None]:
# Declaring the necessary dictionaries for each client
views_desktop = {}
views_mobile_app = {}
views_mobile_web = {}
views_mobile_allaccess = {}

# Iterating through each row we have in the `articles` data
for i, row in articles.iterrows():
    disease = row['disease']

    # Pageviews for desktop
    response_desktop = request_pageviews_per_article(
        article_title=disease,
        request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP
    )
    # Our required data is stored as the value for the key `items`, which is
    # what we'll store directly as our required value
    views_desktop[disease] = response_desktop['items']

    # Pageviews for mobile app
    response_mobile_app = request_pageviews_per_article(
        article_title=disease,
        request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILEAPP
    )
    # Our required data is stored as the value for the key `items`, which is
    # what we'll store directly as our required value
    views_mobile_app[disease] = response_mobile_app['items']

    # Pageviews for mobile web
    response_mobile_web = request_pageviews_per_article(
        article_title=disease,
        request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILEWEB
    )
    # Our required data is stored as the value for the key `items`, which is
    # what we'll store directly as our required value
    views_mobile_web[disease] = response_mobile_web['items']

    # Cumulative pageviews for the disease
    response_allaccess = request_pageviews_per_article(
        article_title=disease,
        request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_ALLACCESS
    )
    # Our required data is stored as the value for the key `items`, which is
    # what we'll store directly as our required value
    views_mobile_allaccess[disease] = response_allaccess['items']

Since the `accesss` property for each month is misleading, we will remove it from al datasets.

In [None]:
# Iterate through all datasets and remove `access` from all views data
datasets = [views_desktop, views_mobile_app, views_mobile_web]
for dataset in datasets:
    for disease_pageviews in dataset.values():
        for month in disease_pageviews:
            # Each entry is a dictionary, so `pop` modifies the memory location
            # in place, hence removing our `access` property
            month.pop('access', None)


Since we need a single dataset for all mobile views, we will add up both app and web views into a single dictionary called `views_mobile`.

The second part of the code snippet makes sure that the final `views` data is right. This check makes sure that we only have numeric value in `views`, because python supports the plus operation on other data types too (strings, lists), so to avoid issues (if the API response structure changes in future?).

In [None]:
# Sum up views from both app and web for mobile, and verify if they added up
views_mobile = {}

for disease in views_mobile_app.keys():
    total_pageviews = []
    # Iterate through both app and web data for mobile
    for month_app, month_web in zip(views_mobile_app[disease], views_mobile_web[disease]):
        # Since both app and web data are same for all properties except `views`,
        # we make use of either web or app data (destructure the values into new
        # dictionary) and just provide the updated views value
        total_pageviews.append({
            **month_app,
            'views': month_app['views'] + month_web['views']
        })
    views_mobile[disease] = total_pageviews

# Check if the views got summed up properly
for disease, total_data in views_mobile.items():
    app_data = views_mobile_app[disease]
    web_data = views_mobile_web[disease]
    for i in range(len(total_data)):
        if total_data[i]['views'] != app_data[i]['views'] + web_data[i]['views']:
            raise Exception("The pageviews don't add up!")
# This code block gets executed only if the entire for loop ran successfully.
else:
    print("All good! Mobile pageviews add up!")

All good! Mobile pageviews add up!


## Writing the final output to JSON files
In this section, we write our final dataset to JSON files (easier to read and consume). We do this by making use of the in-built functionality of `json` package. The output file will be stored in the same drive folder as in where this notebook is present. If run locally, it will be stored in the same local folder as this code notebook is present. Modify the paths accordingly to store it in separate folders.

In [None]:
# Utility function to write the data to JSON
def write_to_json(filename, data):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=4)
        print(f"{filename} created successfully!")

# This writes the files to the same folder that the code is structured in. If
# you want to change the path, make sure you provide the right path.
write_to_json('rare-disease_monthly_mobile_201501-202409.json', views_mobile)
write_to_json('rare-disease_monthly_desktop_201501-202409.json', views_desktop)
write_to_json('rare-disease_monthly_cumulative_201501-202409.json', views_mobile_allaccess)

rare-disease_monthly_mobile_201501-202409.json created successfully!
rare-disease_monthly_desktop_201501-202409.json created successfully!
rare-disease_monthly_cumulative_201501-202409.json created successfully!
