# ICER User Data Analysis Project Demo

Below we show some of the methods we created for our project that we use to better understand the ICER userbase.

We start by first importing the packages we use for this analysis:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## The Data

We begin by looking at the two datasets this analysis is based on. The first of which is the GPFS metadata, which is a dataset containing metadata on every single file within any given user's home directory. Note that the GPFS data comes with no defined columns. This information comes from a metadata file from the same directory as the data

In [2]:
# defining GPFS data columns
column_names = ["Inode (file unique ID)",
"KB Allocated",
"File Size",
"Creation Time in days from today",
"Change Time in days from today",
"Modification time in days from today",
"Acces time in days from today",
"GID numeric ID for the group owner of the file",
"UID numeric ID for the owner of the file"]

# loading in data
GPFS = pd.read_csv("/mnt/research/CMSE495-SS24-ICER/file_system_usage/gpfs-stats/inode-size-age-jan-23",
                   header=None, 
                   names = column_names, 
                   sep=" ",
                   nrows=1000) # for the sake of demo, we only load in 1000 data points

# displaying data
GPFS.head()

Unnamed: 0,Inode (file unique ID),KB Allocated,File Size,Creation Time in days from today,Change Time in days from today,Modification time in days from today,Acces time in days from today,GID numeric ID for the group owner of the file,UID numeric ID for the owner of the file
0,100663296,0,8,1447,1447,3131,1447,2035,762231
1,100663297,0,188,1447,1447,1937,1447,2010,614955
2,100663301,0,567,1447,1447,3142,1447,2035,762231
3,100663304,0,87,1447,1447,3142,1447,2035,762231
4,100663306,0,1689,1447,1447,1937,1447,2010,614955


Next we look at the SLURM data, which contains information on every job submitted to the HPCC from the end of September through October

In [3]:
# loading in data
slurm = pd.read_csv("/mnt/research/CMSE495-SS24-ICER/slurm_usage/DID_FINAL_SLURM_OCT_2023.csv",
                    delimiter="|",
                    nrows=1000)

# dropping unnecessary columns
slurm = slurm.drop(columns=["Unnamed: 0.1","Unnamed: 0"])

# displaying first few rows
slurm.head()

Unnamed: 0,JobID,User,Group,Submit,Start,End,Elapsed,State,Account,AssocID,...,CPUTimeRAW,ReqCPUS,AllocCPUS,ReqMem,MaxRSS,ReqNodes,NNodes,NodeList,ReqTRES,AllocTRES
0,31496544,user_679,group_121,2023-03-21T11:13:45,Unknown,Unknown,00:00:00,PENDING,account_017,assocID_489,...,0,28,0,21000M,,1,1,None assigned,"billing=3192,cpu=28,gres/gpu=4,mem=21000M,node=1",
1,31497932,user_679,group_121,2023-03-21T11:31:18,Unknown,Unknown,00:00:00,PENDING,account_017,assocID_489,...,0,28,0,21000M,,1,1,None assigned,"billing=3192,cpu=28,gres/gpu=4,mem=21000M,node=1",
2,31993628,user_105,group_114,2023-03-22T18:19:12,Unknown,Unknown,00:00:00,PENDING,account_017,assocID_661,...,0,12,0,150G,,1,1,None assigned,"billing=23347,cpu=12,gres/gpu=8,mem=150G,node=1",
3,39087660,user_652,group_054,2023-04-04T13:09:10,Unknown,Unknown,00:00:00,PENDING,account_017,assocID_557,...,0,640,0,20G,,10,10,None assigned,"billing=3112,cpu=640,mem=20G,node=10",
4,59062820,user_188,group_046,2023-05-08T09:58:20,2024-01-01T00:58:57,2024-01-01T00:59:06,00:00:09,COMPLETED,account_017,assocID_676,...,360,40,40,8G,,1,1,skl-029,"billing=1245,cpu=40,mem=8G,node=1","billing=1245,cpu=40,mem=8G,node=1"


## Identifying users with many files

Below, is a function designed for the GPFS dataset. Given a pandas dataframe of the GPFS dataset and a threshold, the function will return the ids (anonymized) of each user along with the number of files they have in their home directory.

If everything worked correctly, you should see a Pandas Series where the index is the user ID, while the value column is the number of files each user above the given threshold has.

In [5]:
from ICER_User_Dat.models import UsersWithManyFiles

# defining threshold for file number cutoff
threshold = 30

# running function
too_many_files = UsersWithManyFiles(GPFS,threshold)

# printing results
print(too_many_files)

UID numeric ID for the owner of the file
500120         81
614955        188
638741         67
753559        106
762231        254
785573         50
831677         90
881083         88
1000000092     47
dtype: int64


## Identifying users underutilizing resources

In [6]:
from ICER_User_Dat.models import FindUnterutilizerSLURM

## THESE ARE STILL WORK IN PROGRESS ##

## Grouping users

In [7]:
from ICER_User_Dat.models import GroupUsersSLURM

## THESE ARE STILL WORK IN PROGRESS ##

## Predicting walltime

In [8]:
from ICER_User_Dat.models import PredWalltimeSLURM

## THESE ARE STILL WORK IN PROGRESS ##