This notebook analyzes logs from SLURM's sinfo outputs. The logs are collected over time from multiple login nodes, and the goal is to:
- Parse sinfo logs from .txt files into structured Pandas DataFrames
- Build a time series of node-level states (IDLE, ALLOCATED, DRAIN, etc.)
- Aggregate and clean the data for downstream analysis or visualization
- Cleans duplicates and identifies rare conflicting entries (e.g., a node with two states at the same time)

In [1]:
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from multiprocessing import Pool
import time
from utils import format_node_names

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.display.precision = 4 # show 4 digits precision

In [2]:
#set paths
folder_path_slurm_data = Path('/projects/2/prjs1098/system_analytics_2024/slurm_data')

folder_paths_system_states = [folder_path_slurm_data / path 
                              for path in ['system_states', 'system_states_int4', 'system_states_int5', '']]

_ = [sorted(list(path.glob("*.txt")))
                            for path in folder_paths_system_states]
all_files = [file for folder in _ for file in folder ]
all_files

[PosixPath('/projects/2/prjs1098/system_analytics_2024/slurm_data/system_states_90.txt')]

In [3]:
def get_date_for_a_day(file_path):
    """
    Parses a SLURM sinfo log file for one day and returns a DataFrame
    with node state transitions over time.

    Parameters:
        file_path (Path): Path to the .txt file containing one day's worth of sinfo snapshots.

    Returns:
        pd.DataFrame: A dataframe with columns ['node', 'time', 'state'] where each row 
                      corresponds to a node state at a particular timestamp.
    """

    dict_list = []

    # Open and read the full contents of the file
    with file_path.open(mode='r') as f:
        f_text = f.read()

    # Split the file into multiple 'sinfo' samples using separator line
    sinfo_day = f_text.split('############################')

    # Skip the first chunk (likely empty or metadata), iterate through each sinfo snapshot
    for sinfo_sample in sinfo_day[1:]:
        sinfo_sample = sinfo_sample.split('\n')
        
        # the fisrt line gives us the time
        time = sinfo_sample[1]
        
        # go through other lines and find the information that you need
        for sample_row in sinfo_sample[3:-1]:

            # extract node number
            node_number = int(sample_row.split()[7])

            # Extract the state of the node group
            state = sample_row.split()[8]

            # Extract the node name(s), possibly compressed (e.g., `node[001-004]`)
            node_names = sample_row.split()[-1]

            # Expand node name ranges using helper function
            node_list = format_node_names(node_names).split(',')
            d1 = {
                'node': node_list,
                'time': [time] * node_number,
                'state': [state] * node_number
            }
            dict_list.append(d1)
            
    # turn this each dictionary to a data frame and concatenate all
    df_list = [pd.DataFrame(d1) for d1 in dict_list]
    df = pd.concat(df_list, axis=0)

    # turn the time into pandas date time
    df['time'] = pd.to_datetime(df['time'], errors='coerce')

    # Sort the result by node and time
    df.sort_values(['node', 'time'], inplace=True)
    return df


In [4]:
# this is to parallelize the parsing of the .txt files from Slurm logs
start_time = time.time()
with Pool(30) as pool:  
    parallel_results = pool.map(get_date_for_a_day, all_files)  # Submit tasks
parallel_duration = time.time() - start_time

In [5]:
# display the dataframe. From slurm logs i get: node, time, state
df = pd.concat(parallel_results, axis=0)
df.sort_values(['node', 'time'], inplace=True)
display(df.head(), len(df))

Unnamed: 0,node,time,state
0,fcn1,2025-02-07 05:25:48,idle
0,fcn1,2025-02-07 05:25:48,idle
0,fcn1,2025-02-07 05:28:49,idle
0,fcn1,2025-02-07 05:28:49,idle
0,fcn1,2025-02-07 05:31:49,idle


1117032

In [6]:
# check for duplicated rows
df[df.duplicated(['node', 'time', 'state'], keep=False)].sort_values(['node', 'time']).head(10)

total_rows = len(df)
num_duplicates = df.duplicated(['node', 'time', 'state']).sum()

print(f"Total rows: {total_rows}")
print(f"Number of duplicated rows (node, time, state): {num_duplicates}")
print(f"Percentage duplicated: {100 * num_duplicates / total_rows:.2f}%")

Total rows: 1117032
Number of duplicated rows (node, time, state): 362096
Percentage duplicated: 32.42%


In [7]:
# drop duplicates
df.drop_duplicates(subset=['node', 'time', 'state'], inplace=True)

In [8]:
"""  
Here we check for duplicaton in node and time. This is extremely rare and it means that
a node at a specific time can have two states! This can happen because we are measuring the states with 1 second precision
and through 3 login nodes. 
Is there any?
"""
print(f"Number of duplicated rows based on node and time: {df.duplicated(['node', 'time']).sum()}")

# show a sample
display(df[df.duplicated(['node', 'time'], keep=False)].head(n=10))
display(df[df.duplicated(['node', 'time'], keep=False)]['node'].value_counts())

Number of duplicated rows based on node and time: 0


Unnamed: 0,node,time,state


Series([], Name: count, dtype: int64)

In [9]:
# drop duplicates: We keep the first occurence and delete others.
df.drop_duplicates(subset=['node', 'time'], inplace=True)

In [10]:
""" 
give a description of the collected data. 
How many samples, how many states, nodes, minimum time maximum time ... 
"""
display(df.describe(include='all'), df['state'].value_counts())

Unnamed: 0,node,time,state
count,754936,754936,754936
unique,1547,,14
top,fcn1,,idle
freq,488,,354383
mean,,2025-02-07 17:36:45.024590592,
min,,2025-02-07 05:25:48,
25%,,2025-02-07 11:31:15,
50%,,2025-02-07 17:36:48,
75%,,2025-02-07 23:42:14,
max,,2025-02-08 05:47:38,


state
idle          354383
allocated     259158
mixed         120482
reserved        7232
drained         5893
planned         3316
completing      1597
down*           1184
draining         846
drained*         835
mixed-             4
idle*              3
draining*          2
mixed*             1
Name: count, dtype: int64

In [10]:
""" 
Are we measuring the data regularly? No
Here we compute the time difference for our measeruments node specific. 

state(node=node1, time=t2) - state(node=node1, time=t1) = Delta t 


what is the max and min for Delta t?
Is this acceptable? or no it makes the analysis difficult?
"""

display(df.groupby(['node'], as_index=False)[['time']].diff().describe(percentiles=[0.25, 0.5, 0.75, 0.9996]))

""" 
The statistic for Delta t shows that a large portion of the measurements have time differnce less than 3 minutes.
But it seems that at some point in our measurment process we did not record states for some nodes for around 2 hours.
This could our measurement faults or even the case that a specific nodes did not appear in the sinfo.
What are those nodes? are they srv nodes?==> No idea
"""

Unnamed: 0,time
count,753389
mean,0 days 00:03:00.102669404
std,0 days 00:00:00.303526803
min,0 days 00:03:00
25%,0 days 00:03:00
50%,0 days 00:03:00
75%,0 days 00:03:00
99.96%,0 days 00:03:01
max,0 days 00:03:01


' \nThe statistic for Delta t shows that a large portion of the measurements have time differnce less than 3 minutes.\nBut it seems that at some point in our measurment process we did not record states for some nodes for around 2 hours.\nThis could our measurement faults or even the case that a specific nodes did not appear in the sinfo.\nWhat are those nodes? are they srv nodes?==> No idea\n'

In [11]:
"""  
How many times do we have this long pauses in the measurements?
"""
df_delta_counts = df.groupby(['node'], as_index=False)[['time']].diff().value_counts()
df_delta_counts.sort_index().tail(n=20)

time           
0 days 00:03:00    676039
0 days 00:03:01     77350
Name: count, dtype: int64

In [12]:
# save the data a and go to EDA notebook for further analysis
time_formated = pd.Timestamp(time.time(), unit='s').strftime('%Y-%m-%d')
# df.to_parquet(folder_path_slurm_data/f"sinfo_cleaned_{time_formated}.parquet.gzip", compression='gzip')