# XDMoD Data Analytics Framework — Introductory Notebook

Document version 3 (2025-07-18)

Compatible with XDMoD Data Analytics Framework v≥1.0.0 and <2.0.0

© 2023–2025 University at Buffalo Center for Computational Research

See the [xdmod-notebooks](https://github.com/ubccr/xdmod-notebooks) repository for information on setup, support, contributing, licensing, and referencing.

## Introduction

The XDMoD Data Analytics Framework provides API access to the data in an XDMoD portal via the [xdmod_data](https://pypi.org/project/xdmod-data) Python package. This notebook provides an introductory explanation of how to use the package. You will use the XDMoD API to request data, load them into [Pandas](https://pandas.pydata.org/) [DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), and generate plots.

The XDMoD Data Analytics Framework can be run either in an XDMoD-hosted JupyterHub (e.g., by clicking the "JupyterLab" button in ACCESS XDMoD) or locally on your machine.

## Install/upgrade the required modules

Run the code below to install/upgrade the packages needed to run this notebook.

In [None]:
import sys
! {sys.executable} -m pip install --upgrade 'xdmod-data[report]>=1.0.0,<2.0.0' python-dotenv tabulate

If running that code caused a new version of Plotly to be installed/upgraded, you may need to refresh your browser window for plots to appear correctly.

## Configure notebook formatting

### Exceptions

Run the code below to simplify how Python exceptions are displayed in this notebook.

In [None]:
import sys
def exception_handler(exception_type, exception, traceback):
    print("%s: %s" % (exception_type.__name__, exception), file=sys.stderr)
get_ipython()._showtraceback = exception_handler

### Tables

Run the code below to set up for displaying Pandas DataFrames as Markdown tables in this notebook.

In [None]:
from IPython.display import display, Markdown
def display_df_md_table(df):
    return display(Markdown(df.replace('\n', '<br/>', regex=True).to_markdown()))

### Plots

Run the code below to set up the external Plotly package to make plots using a custom XDMoD theme.

In [None]:
import plotly.express as px
import plotly.io as pio
import xdmod_data.themes
pio.templates.default = 'timeseries'

## Prepare to authenticate with XDMoD

If you are running this notebook in an XDMoD-hosted JupyterHub (e.g., you clicked the "JupyterLab" button in ACCESS XDMoD), then authentication happens automatically and you can skip this section.

Otherwise, if you are running this notebook in a different Jupyter environment, you will need to obtain an API token from the XDMoD portal following [these instructions](https://github.com/ubccr/xdmod-data#api-token-access) and save it to a file that can be accessed by the Jupyter environment (e.g., in the home directory at `~/xdmod-data.env`) with the contents `XDMOD_API_TOKEN=token`, replacing `token` with your token. This file should be saved with `600` permissions (user read/write only). After you have done this, if you uncomment the last line of the code cell below and run it, it will read your token from `~/xdmod-data.env` into the environment, which will be used later when you start running methods from the API. It will print `True` if it successfully loaded the file.

In [None]:
from dotenv import load_dotenv
from os.path import expanduser
from pathlib import Path
#load_dotenv(Path(expanduser('~/xdmod-data.env'), override=True))

## Initialize the DataWarehouse object

Run the code below to initialize a `DataWarehouse` object that will be used for making the API calls.

If you are running in an XDMoD-hosted JupyterHub, this object will make requests to the same XDMoD portal that is hosting the JupyterHub. To make requests to a different portal instead, you can specify the URL of that portal as a string parameter to the `DataWarehouse` constructor.

Otherwise, if you are running in a different Jupyter environment, you will need to specify the URL of the XDMoD portal as a string parameter to the `DataWarehouse` constructor (or set the `XDMOD_HOST` environment variable).

In [None]:
from xdmod_data.warehouse import DataWarehouse
dw = DataWarehouse()

## Get the data

Run the code below to use the `get_data()` method to request data from XDMoD and load them into a Pandas DataFrame. This example gets the number of active users of ACCESS-allocated resources over a 4-month period. Each of the parameters of the method will be explained later in this notebook. Use `with` to create a runtime context; this is also explained later in this notebook.

In [None]:
with dw:
    df = dw.get_data(
        duration=('2023-01-01', '2023-04-30'),
        realm='Jobs',
        metric='Number of Users: Active',
    )
display(df)

Note that the `df` object is a Pandas DataFrame:

In [None]:
type(df)

## Plot the data

In [None]:
plot = px.line(df, y='Number of Users: Active')
plot.show()

## Do further data processing

You can do further processing on the DataFrame to produce analysis and plots beyond those that are available in the XDMoD portal.

Run the code below to add a column for the day of the week:

In [None]:
df['Day Name'] = df.index.strftime('%a')
display(df)

Run the code below to show a box plot of the data grouped by day of the week:

In [None]:
plot = px.box(
    df,
    x='Day Name',
    y='Number of Users: Active',
    category_orders={'Day Name': ('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')},
)
plot.show()

## Details of the `get_data()` method

Now that you have seen a basic example of using the `get_data()` method, read below for more details on how it works.

### Wrap data warehouse calls in a runtime context

XDMoD data are accessed over a network connection, which involves establishing connections and creating temporary resources. To ensure these connections and resources are cleaned up properly in spite of any runtime errors, you should call data warehouse methods within a **runtime context** by using Python's `with` statement to wrap the execution of XDMoD queries, store the result, and execute any long running calculations outside of the runtime context, as in the template below.

In [None]:
with dw:
    # XDMoD queries would go here
    pass
# Data processing would go here
pass

### Default parameters

The `get_data()` method has a number of parameters; their default values are shown below, and the parameters are explained in more detail further below.

In [None]:
with dw:
    df = dw.get_data(
        duration='Previous month',
        realm='Jobs',
        metric='CPU Hours: Total',
        dimension='None',
        filters={},
        dataset_type='timeseries',
        aggregation_unit='Auto',
    )

### Duration

The **duration** provides the time constraints of the data to be fetched from the XDMoD data warehouse.

As already seen, you can specify the duration as start and end times:

In [None]:
with dw:
    df = dw.get_data(duration=('2023-01-01', '2023-04-30'))

You can instead specify the duration using a special string value; a list of the valid values can be obtained by calling the `get_durations()` method.

In [None]:
with dw:
    durations = dw.get_durations()
display(durations)

### Realm

A **realm** is a category of data in the XDMoD data warehouse. You can use the `describe_realms()` method to get a DataFrame containing the list of available realms.

In [None]:
with dw:
    realms = dw.describe_realms()
display_df_md_table(realms)

### Metric

A **metric** is a statistic for which data exists in a given realm. You can use the `describe_metrics(realm)` method to get a DataFrame containing the list of valid metrics in the given realm. The realm must be passed in as a string.

In [None]:
with dw:
    metrics = dw.describe_metrics('Jobs')
display_df_md_table(metrics)

### Dimension

A **dimension** is a grouping of data. You can use the `describe_dimensions(realm)` method to get a DataFrame containing the list of valid dimensions in the given realm. The realm must be passed in as a string.

In [None]:
with dw:
    dimensions = dw.describe_dimensions('Jobs')
display_df_md_table(dimensions)

The code below shows how to get data grouped by the `Resource` dimension and plot them.

In [None]:
metric_label = 'Number of Users: Active'
with dw:
    df = dw.get_data(
        duration=('2023-01-01', '2023-04-30'),
        realm='Jobs',
        metric=metric_label,
        dimension='Resource',
    )
plot = px.line(df, labels={'value': metric_label})
plot.show()

### Pass in realms, metrics, and dimensions using labels or IDs

For methods in the API that take realms, metrics, and/or dimensions as arguments, you can pass them in as their labels or their IDs.

In [None]:
with dw:
    df = dw.get_data(
        duration='10 year',
        realm='Allocations',
        metric='NUs: Allocated', # 'allocated_nu' also works
        dimension='Resource Type',  # 'resource_type' also works
    )

### Filters

**Filters** allow you to include only data that have certain values for given dimensions. You can use the `get_filter_values(realm, dimension)` method to get a DataFrame containing the list of valid filter values for the given dimension in the given realm. The realm and dimension must be passed in as strings.

In [None]:
with dw:
    filter_values = dw.get_filter_values('Jobs', 'Resource') # 'resource' also works
display_df_md_table(filter_values)

For methods in the API that take filters as arguments, you must specify the filters as a dictionary in which the keys are dimensions (labels or IDs) and the values are string filter values (labels or IDs) or sequences of string filter values. For example, to return only data for which the field of science is biophysics and the resource is either NCSA Delta GPU or TACC Stampede2:

In [None]:
with dw:
    df = dw.get_data(
        filters={
            'Field of Science': 'Biophysics', # 'fieldofscience': '246' also works
            'Resource': ( # 'resource' also works
                'NCSA DELTA GPU', # '3032' also works
                'STAMPEDE2 TACC', # '2825' also works
            ),
        },
    )

### Dataset Type

The **dataset type** can either be 'timeseries' (the default), in which data are grouped by a time [aggregation unit](#Aggregation-unit), or 'aggregate', in which the data are aggregated across the entire [duration](#Duration). For 'aggregate', the results are returned as a Pandas Series rather than a DataFrame.

The code below shows how to create a bar plot of data aggregated over four months, grouped by resource.

In [None]:
metric_label = 'Number of Users: Active'
with dw:
    df = dw.get_data(
        duration=('2023-01-01', '2023-04-30'),
        realm='Jobs',
        metric=metric_label,
        dimension='Resource',
        dataset_type='aggregate',
    )
plot = px.bar(df, labels={'value': metric_label})
plot.update_layout(
    showlegend=False,
    xaxis_automargin=True,
)
plot.show()

### Aggregation unit

The **aggregation unit** specifies how data are aggregated by time. You can get a list of valid aggregation units by calling the `get_aggregation_units()` method.

In [None]:
with dw:
    display(dw.get_aggregation_units())

## Choropleth example

As another example of the types of visualizations you can make using the Data Analytics Framework, the code cell below gets the total number of users in each US state (based on the location of the users' institutions) and displays a choropleth map.

In [None]:
state_names_to_abbreviations = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'US Virgin Islands': 'VI',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}
metric = 'Number of Users: Active'
with dw:
    df = dw.get_data(
        duration=('2023-01-01', '2023-04-30'),
        realm='Jobs',
        metric=metric,
        dimension='User Institution State',
        filters={
            'User Institution Country': 'United States',
        },
        dataset_type='aggregate',
    ).to_frame()
df['Abbreviation'] = df.index.map(state_names_to_abbreviations)
plot = px.choropleth(
    df,
    locations='Abbreviation',
    color_continuous_scale=px.colors.carto.Temps,
    color=metric,
    locationmode='USA-states',
    scope='usa',
)
plot.update_layout(
    margin={'b': 0.5, 't': 0.5},
)
plot.show()

## Additional examples

For additional examples, please see the [xdmod-notebooks repository](https://github.com/ubccr/xdmod-notebooks).

In [None]:
# This cell is used to create the footer of this notebook.
from xdmod_data.report import footer
footer({
    'history': [
        ['1', '2023-07-21', 'Initial version'],
        ['2', '2024-09-27', 'Add more example plots and update documentation'],
        ['3', '2025-07-18', 'Update for JupyterHub support, add choropleth example'],
    ],
})