# XDMoD Data Analytics Framework — Raw Data Examples

Document version 2 (2025-07-18)

Compatible with XDMoD Data Analytics Framework v≥1.0.0 and <2.0.0

© 2023–2025 University at Buffalo Center for Computational Research

See the [xdmod-notebooks](https://github.com/ubccr/xdmod-notebooks) repository for information on setup, support, contributing, licensing, and referencing.

## Introduction
The XDMoD Data Analytics Framework provides API access to the data in an XDMoD portal via the [xdmod_data](https://pypi.org/project/xdmod-data) Python package. This notebook provides examples showing how to use the `get_raw_data()` method to obtain and process individual records from XDMoD. In this example, you will obtain low-level job performance data from the `SUPREMM` realm in XDMoD.

The XDMoD Data Analytics Framework can be run either in an XDMoD-hosted JupyterHub (e.g., by clicking the "JupyterLab" button in ACCESS XDMoD) or locally on your machine.

## Install/upgrade the required modules
Run the code below to install/upgrade the packages needed to run this notebook.

In [None]:
import sys
! {sys.executable} -m pip install --upgrade 'xdmod-data[report]>=1.0.0,<2.0.0' python-dotenv tabulate

If running that code caused a new version of Plotly to be installed/upgraded, you may need to refresh your browser window for plots to appear correctly.

## Configure notebook formatting

### Exceptions
Run the code below to simplify how Python exceptions are displayed in this notebook.

In [None]:
import sys
def exception_handler(exception_type, exception, traceback):
    print("%s: %s" % (exception_type.__name__, exception), file=sys.stderr)
get_ipython()._showtraceback = exception_handler

### Tables
Run the code below to set up for displaying Pandas DataFrames as Markdown tables in this notebook.

In [None]:
from IPython.display import display, Markdown
def display_df_md_table(df):
    return display(Markdown(df.replace('\n', '<br/>', regex=True).to_markdown()))

### Plots
Run the code below to set up the external Plotly package to make plots using a custom XDMoD theme.

In [None]:
import plotly.express as px
import plotly.io as pio
import xdmod_data.themes
pio.templates.default = 'timeseries'

## Prepare to authenticate with XDMoD

If you are running this notebook in an XDMoD-hosted JupyterHub (e.g., you clicked the "JupyterLab" button in ACCESS XDMoD), then authentication happens automatically and you can skip this section.

Otherwise, if you are running this notebook in a different Jupyter environment, you will need to obtain an API token from the XDMoD portal following [these instructions](https://github.com/ubccr/xdmod-data#api-token-access) and save it to a file that can be accessed by the Jupyter environment (e.g., in the home directory at `~/xdmod-data.env`) with the contents `XDMOD_API_TOKEN=token`, replacing `token` with your token. This file should be saved with `600` permissions (user read/write only). After you have done this, if you uncomment the last line of the code cell below and run it, it will read your token from `~/xdmod-data.env` into the environment, which will be used later when you start running methods from the API. It will print `True` if it successfully loaded the file.

In [None]:
from dotenv import load_dotenv
from os.path import expanduser
from pathlib import Path
#load_dotenv(Path(expanduser('~/xdmod-data.env'), override=True))

## Initialize the DataWarehouse object

Run the code below to initialize a `DataWarehouse` object that will be used for making the API calls.

If you are running in an XDMoD-hosted JupyterHub, this object will make requests to the same XDMoD portal that is hosting the JupyterHub. To make requests to a different portal instead, you can specify the URL of that portal as a string parameter to the `DataWarehouse` constructor.

Otherwise, if you are running in a different Jupyter environment, you will need to specify the URL of the XDMoD portal as a string parameter to the `DataWarehouse` constructor (or set the `XDMOD_HOST` environment variable).

In [None]:
from xdmod_data.warehouse import DataWarehouse
dw = DataWarehouse()

## Get the raw data

Run the code below to use the `get_raw_data()` method to request raw data from XDMoD and load them into a Pandas DataFrame. This example gets three days' worth of low-level performance data of jobs run on ACCESS-allocated resources. Each of the parameters of the method will be explained later in this notebook. Use `with` to create a runtime context; this is also explained later in this notebook.

In [None]:
with dw:
    df = dw.get_raw_data(
        duration=('2023-05-01', '2023-05-03'),
        realm='SUPREMM',
        show_progress=True,
    )

Note that even just three days' worth of raw data constitutes over 100,000 rows. This is contrasted to the `get_data()` method, which aggregates data over a time period (day, week, month, etc.).

Inspect the data:

In [None]:
display(df)

Each row has many columns of data. View the names of all the columns:

In [None]:
display(df.columns)

Choose which columns to analyze. For example, compare wall time and total memory used in a scatter plot:

In [None]:
plot = px.scatter(
    df,
    x='Wall Time',
    y='Total memory used',
    title='Total memory used vs. wall time of ACCESS jobs, 05/01/2023–05/03/2023',
)
plot.update_layout(hovermode=False) # Prevent hover interactions causing lag due to so many points
plot.show()

Wall time is measured in seconds, and total memory used is measured in bytes. Convert these to hours and gigabytes, respectively:

In [None]:
df['Wall Time (hours)'] = df['Wall Time'].astype(float) / 3600
df['Total memory used (GB)'] = df['Total memory used'].astype(float) / 1e9

Plot the data again:

In [None]:
plot = px.scatter(
    df,
    x='Wall Time (hours)',
    y='Total memory used (GB)',
    title='Total memory used vs. wall time of ACCESS jobs, 05/01/2023–05/03/2023',
)
plot.update_layout(hovermode=False) # Prevent hover interactions causing lag due to so many points
plot.show()

Looking at the graph, many jobs ran for under 48 hours, while some ran for longer. Many jobs used less than 500 GB, while some used more. It is important to note that these data come from multiple different computing resources, each of which has its own architecture and scheduling policies. Color-code the graph by the resource:

In [None]:
plot = px.scatter(
    df,
    x='Wall Time (hours)',
    y='Total memory used (GB)',
    title='Total memory used vs. wall time of ACCESS jobs by resource, 05/01/2023–05/03/2023',
    color='Resource',
)
plot.show()

One can begin to see from this graph that the different resources are used differently. Filter by a specific resource, e.g., PSC Bridges-2 Regular Memory:

In [None]:
df = df[df['Resource'] == 'bridges2-rm.psc.xsede.org']

In [None]:
plot = px.scatter(
    df,
    x='Wall Time (hours)',
    y='Total memory used (GB)',
    title='Total memory used vs. wall time of jobs on Bridges-2 RM, 05/01/2023–05/03/2023',
)
plot.show()

A better approach, if you know you only need to analyze the data from specific resource(s), is to modify the original call to `get_raw_data()` to include a `filters` parameter (this parameter will be explained in detail later in this notebook):

In [None]:
with dw:
    df = dw.get_raw_data(
        duration=('2023-05-01', '2023-05-03'),
        realm='SUPREMM',
        filters={
            'Resource': 'Bridges 2 RM',
        },
        show_progress=True,
    )

This requests fewer rows, taking less time to transfer and less memory.

In [None]:
df['Wall Time (hours)'] = df['Wall Time'].astype(float) / 3600
df['Total memory used (GB)'] = df['Total memory used'].astype(float) / 1e9

With the data from the specific resource, you can further drill down by field of science:

In [None]:
plot = px.scatter(
    df,
    x='Wall Time (hours)',
    y='Total memory used (GB)',
    title='Total memory used vs. wall time of jobs on Bridges-2 RM by field of science, 05/01/2023–05/03/2023',
    color='Field of Science',
)
plot.update_layout(height=550) # Make sure the plot can accommodate a larger legend
plot.show()

If you want to analyze the data for a specific field of science, you can add it to the list of `filters` in the original call to `get_raw_data()`. If you do not need to drill down any further, you can also restrict the requested fields of data to only those you need (e.g., wall time and total memory used) by using the `fields` parameter:

In [None]:
with dw:
    df = dw.get_raw_data(
        duration=('2023-05-01', '2023-05-03'),
        realm='SUPREMM',
        fields=(
            'Wall Time',
            'Total memory used',
        ),
        filters={
            'Resource': 'Bridges 2 RM',
            'Field of Science': 'Chemical Engineering',
        },
        show_progress=True,
    )

This greatly reduces the amount of data that needs to be requested, taking up less time and memory.

In [None]:
df['Wall Time (hours)'] = df['Wall Time'].astype(float) / 3600
df['Total memory used (GB)'] = df['Total memory used'].astype(float) / 1e9

In [None]:
plot = px.scatter(
    df,
    x='Wall Time (hours)',
    y='Total memory used (GB)',
    title='Total memory used vs. wall time of chemical engineering jobs on Bridges-2 RM, 05/01/2023–05/03/2023',
)
plot.show()

## Details of the `get_raw_data()` method
Now that you have seen examples of using the `get_raw_data()` method, read below for more details on how it works.

### Wrap data warehouse calls in a runtime context
XDMoD data are accessed over a network connection, which involves establishing connections and creating temporary resources. To ensure these connections and resources are cleaned up properly in spite of any runtime errors, you should call data warehouse methods within a **runtime context** by using Python's `with` statement to wrap the execution of XDMoD queries, store the result, and execute any long running calculations outside of the runtime context, as in the template below.

In [None]:
with dw:
    # XDMoD queries would go here
    pass
# Data processing would go here
pass

### Parameters
The `get_raw_data()` method has a number of parameters explained in detail below.

### Duration
The **duration** provides the time constraints of the data to be fetched from the XDMoD data warehouse.

As already seen, you can specify the duration as start and end dates. These are both inclusive, so if you only want one day of data, use the same date for start and end:

In [None]:
with dw:
    df = dw.get_raw_data(
        duration=('2023-05-01', '2023-05-01'),
        realm='SUPREMM'
    )

You can instead specify the duration using a special string value; a list of the valid values can be obtained by calling the `get_durations()` method.

In [None]:
with dw:
    durations = dw.get_durations()
display(durations)

### Realm
A **realm** is a category of data in the XDMoD data warehouse. You can use the `describe_raw_realms()` method to get a DataFrame containing the list of realms for which raw data are available.

In [None]:
with dw:
    raw_realms = dw.describe_raw_realms()
display_df_md_table(raw_realms)

### Fields
A **field** is a measurement for which raw data exists in a given realm. You can use the `describe_raw_fields(realm)` method to get a DataFrame containing the list of valid fields in the given realm. The realm must be passed in as a string.

In [None]:
with dw:
    raw_fields = dw.describe_raw_fields('SUPREMM')
display_df_md_table(raw_fields)

### Filters
**Filters** allow you to include only data that have certain values for given **dimensions**, which are groupings of data. You can use the `describe_dimensions(realm)` method to get a DataFrame containing the list of valid dimensions in the given realm. The realm must be passed in as a string and can be either the ID or the label of the realm.

In [None]:
with dw:
    dimensions = dw.describe_dimensions('SUPREMM')
display_df_md_table(dimensions)

You can use the `get_filter_values(realm, dimension)` method to get a DataFrame containing the list of valid filter values for the given dimension in the given realm. The realm and dimension must be passed in as strings.

In [None]:
with dw:
    filter_values = dw.get_filter_values('SUPREMM', 'Resource') # 'resource' also works
display_df_md_table(filter_values)

For methods in the API that take filters as arguments, you must specify the filters as a dictionary in which the keys are dimensions (labels or IDs) and the values are string filter values (labels or IDs) or sequences of string filter values. For example, to return only data for which the field of science is Materials Engineering and the resource is either Bridges-2 RM or TACC Stampede2:

In [None]:
with dw:
    df = dw.get_raw_data(
        duration=('2021-05-01', '2021-05-01'),
        realm='SUPREMM',
        filters={
            'Field of Science': 'Materials Engineering', # 'fieldofscience': '177' also works
            'Resource': ( # 'resource' also works
                'Bridges 2 RM', # '2900' also works
                'STAMPEDE2 TACC', # '2825' also works
            ),
        },
        show_progress=True
    )

### Show progress
Set the `show_progress` parameter to `True` to periodically print how many rows have been gotten so far. 

## Additional Examples
For additional examples, please see the [xdmod-notebooks repository](https://github.com/ubccr/xdmod-notebooks).

In [None]:
# This cell is used to create the footer of this notebook.
from xdmod_data.report import footer
footer({
    'history': [
        ['1', '2023-07-21', 'Initial version'],
        ['2', '2025-07-18', 'Update for JupyterHub support'],
    ],
})