# XDMoD Data Analytics Framework — Random Forest Example
University at Buffalo — Center for Computational Research

## Introduction
The `xdmod_data` Python module provides API access to the data in XDMoD. This notebook shows an example of how to use the `get_raw_data()` method to obtain and process individual records. In this example, you will obtain job performance data, which is contained in the `SUPREMM` realm in XDMoD, and use the data to train a machine learning model.

## Configure IPython notebook formatting

### Exceptions
Run the code below to simplify how Python exceptions are displayed in this notebook.

In [1]:
import sys
def exception_handler(exception_type, exception, traceback):
    print("%s: %s" % (exception_type.__name__, exception), file=sys.stderr)
get_ipython()._showtraceback = exception_handler

### Tables
Run the code below to set up for displaying Pandas DataFrames as Markdown tables in this notebook.

In [2]:
from IPython.display import display, Markdown
def display_df_md_table(df):
    return display(Markdown(df.replace('\n', '<br/>', regex=True).to_markdown()))

### Plots
Run the code below to set up the external Plotly library to make plots using a custom XDMoD theme.

In [3]:
import plotly.express as px
import plotly.io as pio
import xdmod_data.themes
pio.templates.default = "timeseries"

## Create an environment file
The `xdmod-data.env` file will store your XDMoD API token.

Run the code below to create the file in your home directory (if it does not already exist) and allow only you to read and write to it.

In [3]:
from pathlib import Path
from os.path import expanduser
xdmod_data_env_path = Path(expanduser('~/xdmod-data.env'))
try:
    with open(xdmod_data_env_path):
        pass
except FileNotFoundError:
    with open(xdmod_data_env_path, 'w') as xdmod_data_env_file:
        xdmod_data_env_file.write('XDMOD_API_TOKEN=')
    xdmod_data_env_path.chmod(0o600)

## Obtain an API token
Follow [these instructions](https://open.xdmod.org/data-analytics-framework.html#api-token-generation) to obtain an API token.

## Store your API token in the environment file
Open the `xdmod-data.env` file and paste your token after `XDMOD_API_TOKEN=`. Make sure there are no spaces before or after the equals sign.

Save the file.

## Load your XDMoD API token into the environment
Run the code below to load the contents of the `xdmod-data.env` file into the environment. It will print `True` if it successfully loaded the file.

In [2]:
from dotenv import load_dotenv
load_dotenv(xdmod_data_env_path, override=True)

True

## Initialize the XDMoD Data Warehouse
Run the code below to prepare for getting data from the XDMoD data warehouse at the given URL.

In [4]:
from xdmod_data.warehouse import DataWarehouse
dw = DataWarehouse(xdmod_host='https://xdmod-dev.ccr.xdmod.org')

## Get the raw data

Use the `get_raw_data()` method to query XDMoD and load the resulting raw data into a Pandas DataFrame. For example, [...\]. Each of the parameters of the method will be explained later in this notebook. Use `with` to create a runtime context; this is also explained later in this notebook.

In [7]:
with dw:
    data = dw.get_raw_data(
        duration=('2022-01-01', '2022-02-01'),
        realm='SUPREMM',
        fields=(
            #'Application',
            'CPU User',
            'CPU User cov',
            'Wall Time',
            #'Memory Used',
            'Net Ib0 Rx',
            'Net Ib0 Tx',
            'Memory Used Cov',
            'Net Ib0 Rx Cov',
            'Net Ib0 Tx Cov',
        ),
        filters={
            'Resource': 'STAMPEDE2 TACC',
        },
        show_progress=True
    )

Got 116564 rows...DONE


## Inspect the data

In [8]:
print(type(data))

<class 'pandas.core.frame.DataFrame'>


In [9]:
print(data.dtypes)

Wall Time          string[python]
CPU User           string[python]
CPU User cov       string[python]
Memory Used Cov    string[python]
Net Ib0 Rx         string[python]
Net Ib0 Tx         string[python]
Net Ib0 Rx Cov     string[python]
Net Ib0 Tx Cov     string[python]
dtype: object


In [10]:
display(data)

Unnamed: 0,Wall Time,CPU User,CPU User cov,Memory Used Cov,Net Ib0 Rx,Net Ib0 Tx,Net Ib0 Rx Cov,Net Ib0 Tx Cov
0,1819,97.0192650621672,0.0922169508114965,0.1712162740324448,,,,
1,1406,5.443026232222929,1.3310222051640888,0.33350497233333687,,,,
2,1732,6.326001282872144,1.503615544615307,0.33110851995880675,,,,
3,86414,99.48133814417768,0.0012969676282528703,0.053891950207007544,,,,
4,1019,99.34625044231122,0.00072267430055728,0.5083797744115175,,,,
...,...,...,...,...,...,...,...,...
116559,59359,91.45594013183106,0.029889844729945648,0.03532388876643469,,,,
116560,5015,32.710347492129785,0.19673823570113705,0.2198490132514679,,,,
116561,32,8.511856545151716,0.7224990700116423,0,,,,
116562,5,3.97252363881105,1.9452025150708465,0,,,,


## Prepare data for training
Looking at the summary table, there are some records that have `<NA>` values. This is not unusal and is typically because the data were not collected (insufficient data points, errors in the data collection, etc.).

To prepare for training, filter out records that have `<NA>` data:

In [None]:
data = data.dropna()
display(data)

The `get_raw_data()` method returns a DataFrame containing string data. To make it easier to manipulate and plot the data and use it for training, convert this to floating point and integer data:

In [13]:
data = data.astype(float)
data['Wall Time'] = data['Wall Time'].astype(int)
print(data.dtypes)
display(data)

Wall Time            int64
CPU User           float64
CPU User cov       float64
Memory Used Cov    float64
Net Ib0 Rx         float64
Net Ib0 Tx         float64
Net Ib0 Rx Cov     float64
Net Ib0 Tx Cov     float64
dtype: object


Unnamed: 0,Wall Time,CPU User,CPU User cov,Memory Used Cov,Net Ib0 Rx,Net Ib0 Tx,Net Ib0 Rx Cov,Net Ib0 Tx Cov
