## Imports and Dependencies

In this section, we begin by importing the necessary Python libraries and modules required for our project. The following standard and external libraries will be used:

### Standard Python Libraries:
- **json**: To handle JSON data.
- **time**: Provides various time-related functions.
- **urllib.parse**: To handle and manipulate URLs.

### External Libraries:
- **requests**: For making HTTP requests to fetch data from web APIs (you may need to install this package using `pip install requests`).
- **pandas**: A powerful data manipulation and analysis library (can be installed using `pip install pandas`).
- **datetime**: To handle date and time data, particularly useful for working with timestamps.
- **matplotlib.pyplot**: A popular data visualization library for creating plots and charts (can be installed using `pip install matplotlib`).

Ensure that all external libraries are installed before running the notebook.


In [4]:
# These are standard python modules
import json, time, urllib.parse


# The following modules are not standard Python modules. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

## Constants and API Setup

This section defines constants and parameters for interacting with the Wikimedia Pageviews API.

### API URL and Parameters:
- **`API_REQUEST_PAGEVIEWS_ENDPOINT`**: Base URL for Wikimedia 'pageviews' requests.
- **`API_REQUEST_PER_ARTICLE_PARAMS`**: Template string for 'per-article' API requests, allowing customization of project, access type, article, date range, etc.

### Rate Limiting:
- **`API_LATENCY_ASSUMED`** & **`API_THROTTLE_WAIT`**: Ensure compliance with Wikimedia's rate limit (100 requests/second) by introducing a small delay.

### User-Agent:
- **`REQUEST_HEADERS`**: Includes contact details to comply with API requirements.

### Article List:
- **`ARTICLE_TITLES`**: Example articles like 'Bison' and 'Chinook salmon' for which we will fetch pageviews.

### API Template:
- **`ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE`**: Template dictionary to structure API parameters, with a set date range and customizable article names.


In [5]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<tbaner@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",
    "end":         "2024093000" 
}

## Procedures and Functions

This section outlines the function that retrieves pageview data for individual Wikipedia articles using the Wikimedia Pageviews API.

### `request_pageviews_per_article()`

This function requests the monthly pageviews for a specified Wikipedia article. The function accepts parameters like the article title, access type (e.g., desktop, mobile), and constructs the full API request URL using predefined constants. Below is a breakdown of its key components:

- **Function Parameters**:
  - `article_title`: The title of the Wikipedia article (default is `None`). The function raises an exception if no article is provided.
  - `access_type`: Specifies how the article was accessed (e.g., desktop or mobile; default is "desktop").
  - `endpoint_url`: Base URL of the Wikimedia Pageviews API.
  - `endpoint_params`: Template for the specific API request parameters.
  - `request_template`: A dictionary containing request parameters, which are customized for each article.
  - `headers`: HTTP headers, including User-Agent information, to identify the requester.

- **Article Title Handling**:
  - The function checks if an article title is provided. If so, it updates the `article` key in the `request_template` with the provided title.
  - The article title is URL-encoded to handle spaces and special characters before being formatted into the request URL.

- **Access Type**:
  - The `access_type` (e.g., desktop, mobile-web) is set in the request template. This helps filter the pageviews based on how the article was accessed.

- **URL Construction**:
  - The function dynamically formats the URL by inserting values from the `request_template` into the API request template string. This constructs the final URL for the API request.

- **Throttling**:
  - To comply with the Wikimedia API’s rate limits, the function includes a small delay (`API_THROTTLE_WAIT`) before making each request. This ensures we don’t exceed the maximum allowed requests per second.

- **Making the API Request**:
  - The function uses `requests.get()` to send the API request. It includes error handling to catch and report exceptions during the request process.

- **Response Handling**:
  - If the request is successful, the function returns the API’s JSON response, which contains the pageview data. If there’s an error or exception, it prints the error and returns `None`.

The function can be used in a loop to request pageview data for multiple articles by passing different article titles into the function.


In [6]:
#########
#
#    PROCEDURES/FUNCTIONS
#

# Function to request pageviews per article
def request_pageviews_per_article(article_title=None, 
                                  access_type="desktop",
                                  endpoint_url=API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params=API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers=REQUEST_HEADERS):

    # Set the article title
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Set the access type
    request_template['access'] = access_type
    
    # Encode the article title for URL
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # Create the request URL
    request_url = endpoint_url + endpoint_params.format(**request_template)
    
    # Make the request and handle exceptions
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


## Loading Article Titles and Data Initialization

In this section, we load article titles from a CSV file, which contains a column labeled 'disease' representing Wikipedia page titles. These titles will be used to request pageview data from the Wikimedia API. Additionally, we initialize empty lists to store the pageview data for desktop, mobile, and cumulative views.

### Steps:
- **Loading the CSV File**:
  - We use `pandas` to load the CSV file `rare-disease_cleaned.AUG.2024.csv` and extract the 'disease' column into a list of article titles.

- **Data Initialization**:
  - We initialize three empty lists: `desktop_data`, `mobile_data`, and `cumulative_data`. These lists will store the pageview data for each article, categorized by access type (desktop and mobile). Cumulative data will represent the total pageviews from both access types combined.

This setup allows us to loop through each article title and request pageview data using the `request_pageviews_per_article()` function.


In [7]:
#Load the article titles from the provided CSV file (column: 'disease')
df_pages = pd.read_csv("../data/rare-disease_cleaned.AUG.2024.csv")
pages = df_pages['disease'].tolist()  # Convert the 'disease' column to a list of article titles

# Initialize empty lists to store the collected data for desktop, mobile, and cumulative views
desktop_data = []
mobile_data = []
cumulative_data = []

## Fetching Pageview Data for Each Article

This section iterates over each article title from the `pages` list to request pageview data from the Wikimedia API. Data is collected for three access types: desktop, mobile (web + app), and cumulative views. For each article, the function stores the results in corresponding lists.

### Steps:
1. **Desktop Views**:
   - Calls the `request_pageviews_per_article()` function to retrieve pageviews for desktop access.
   - If data is available, it loops through each monthly entry and appends the article title, timestamp, and views to the `desktop_data` list.
   - If no data is found for desktop, it prints a message indicating no data for that article.

2. **Mobile Views (mobile-web + mobile-app)**:
   - Fetches pageviews separately for mobile-web and mobile-app access types.
   - The data for both access types is summed, and the combined views are stored in the `mobile_data` list.
   - If no data is found for either mobile-web or mobile-app, a message is printed.

3. **Cumulative Views (All Access Types)**:
   - Retrieves cumulative pageviews across all access types.
   - Each entry is stored in the `cumulative_data` list, including the article title, timestamp, and views.
   - If no cumulative data is found, a message is printed for that article.

This loop ensures that all access types are covered and that each request handles missing data gracefully by printing appropriate error messages.


In [8]:
#Iterate over each article and fetch pageview data for different access types
for page in pages:
    # Fetch desktop views
    desktop_views = request_pageviews_per_article(article_title=page, access_type="desktop")
    if desktop_views and 'items' in desktop_views:
        for month in desktop_views['items']:
            month_data = {
                "article_title": page,
                "timestamp": month['timestamp'],
                "views": month['views']
            }
            desktop_data.append(month_data)
    else:
        print(f"No data for {page} (desktop)")

    # Fetch mobile views (sum mobile-web and mobile-app)
    mobile_web_views = request_pageviews_per_article(article_title=page, access_type="mobile-web")
    mobile_app_views = request_pageviews_per_article(article_title=page, access_type="mobile-app")
    
    if mobile_web_views and 'items' in mobile_web_views and mobile_app_views and 'items' in mobile_app_views:
        for web_month, app_month in zip(mobile_web_views['items'], mobile_app_views['items']):
            if web_month['timestamp'] == app_month['timestamp']:  # Ensure the timestamps match
                month_data = {
                    "article_title": page,
                    "timestamp": web_month['timestamp'],
                    "views": web_month['views'] + app_month['views']  # Sum mobile-web and mobile-app views
                }
                mobile_data.append(month_data)
    else:
        print(f"No data for {page} (mobile-web or mobile-app)")

    # Fetch cumulative views (all-access)
    cumulative_views = request_pageviews_per_article(article_title=page, access_type="all-access")
    if cumulative_views and 'items' in cumulative_views:
        for month in cumulative_views['items']:
            month_data = {
                "article_title": page,
                "timestamp": month['timestamp'],
                "views": month['views']
            }
            cumulative_data.append(month_data)
    else:
        print(f"No data for {page} (all-access)")


No data for Sulfadoxine/pyrimethamine (desktop)
No data for Sulfadoxine/pyrimethamine (mobile-web or mobile-app)
No data for Sulfadoxine/pyrimethamine (all-access)
No data for Cystine/glutamate transporter (desktop)
No data for Cystine/glutamate transporter (mobile-web or mobile-app)
No data for Cystine/glutamate transporter (all-access)
No data for Trimethoprim/sulfamethoxazole (desktop)
No data for Trimethoprim/sulfamethoxazole (mobile-web or mobile-app)
No data for Trimethoprim/sulfamethoxazole (all-access)


## Converting Collected Data into DataFrames

After collecting the pageview data for desktop, mobile, and cumulative views, the data is converted into `pandas` DataFrames for easier manipulation and analysis.

### Steps:
1. **Desktop Data**:
   - The list `desktop_data`, which contains pageviews for desktop access, is converted into a DataFrame `df_desktop`.

2. **Mobile Data**:
   - The list `mobile_data`, containing combined mobile-web and mobile-app views, is converted into a DataFrame `df_mobile`.

3. **Cumulative Data**:
   - The `cumulative_data` list, storing total pageviews across all access types, is converted into a DataFrame `df_cumulative`.

These DataFrames will allow for efficient data manipulation, analysis, and visualization in subsequent steps.


In [9]:
#Convert the collected data into DataFrames
df_desktop = pd.DataFrame(desktop_data)
df_mobile = pd.DataFrame(mobile_data)
df_cumulative = pd.DataFrame(cumulative_data)

## Saving the Results to JSON Files

After converting the collected pageview data into DataFrames, the data is saved to JSON files for future use or sharing. Each DataFrame (desktop, mobile, cumulative) is exported as a separate JSON file.

### Steps:
1. **Desktop Data**:
   - The `df_desktop` DataFrame is saved as `rare-disease_monthly_desktop_201507-202409.json` in the specified directory, formatted for readability with indentation.

2. **Mobile Data**:
   - The `df_mobile` DataFrame is saved as `rare-disease_monthly_mobile_201507-202409.json`, with records organized and indented.

3. **Cumulative Data**:
   - The `df_cumulative` DataFrame is saved as `rare-disease_monthly_cumulative_201507-202409.json` in the same structured format.

These JSON files contain the monthly pageview data across different access types.


In [12]:
#Save the results to separate JSON files
df_desktop.to_json("../data/rare-disease_monthly_desktop_201507-202409.json", orient='records', indent=4)
df_mobile.to_json("../data/rare-disease_monthly_mobile_201507-202409.json", orient='records', indent=4)
df_cumulative.to_json("../data/rare-disease_monthly_cumulative_201507-202409.json", orient='records', indent=4)