# Demo Notebook for ICER_Data_Package 
This notebook is for testing methods in DataAnalyzer class on ICER data files: GPFS and SLURM

In [1]:
# Import class
from ICER_Data_Package.DataAnalyzer import DataAnalyzer

## Step 1: Prepare Your Data Files
1. GPFS Data File: This file should be a text file with space-separated values, containing columns as described in read_gpfs method. The expected columns are "Inode (file unique ID)", "KB Allocated", "File Size", etc.
2. Slurm Data File: This file should be a text file with values separated by a pipe (|), according to read_slurm method's specification.

## Step 2: Initialize the DataAnalyzer Class

You'll need to provide paths to your GPFS and Slurm data files when initializing the DataAnalyzer class. If you only have one type of data file or the other, you can provide an empty string for the path you don't have.

In [25]:
# set gpfs path, slurm path, number of rows to read from in gpfs and slurm data 
analyzer = DataAnalyzer(gpfspath = "/mnt/research/CMSE495-SS24-ICER/file_system_usage/gpfs-stats/inode-size-age-jan-23", slurmpath="/mnt/research/CMSE495-SS24-ICER/slurm_usage/DID_FINAL_SLURM_OCT_2023.csv"
                    ,gpfs_nrows=1000,slurm_nrows=10000)


gpfs path= /mnt/research/CMSE495-SS24-ICER/file_system_usage/gpfs-stats/inode-size-age-jan-23
slurm path= /mnt/research/CMSE495-SS24-ICER/slurm_usage/DID_FINAL_SLURM_OCT_2023.csv


In [22]:
# To access each dataset; default number of rows=1,000
analyzer.gpfs_data.head()
analyzer.slurm_data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,JobID,User,Group,Submit,Start,End,Elapsed,State,...,CPUTimeRAW,ReqCPUS,AllocCPUS,ReqMem,MaxRSS,ReqNodes,NNodes,NodeList,ReqTRES,AllocTRES
0,0,0,31496544,user_679,group_121,2023-03-21T11:13:45,Unknown,Unknown,00:00:00,PENDING,...,0,28,0,21000M,,1,1,None assigned,"billing=3192,cpu=28,gres/gpu=4,mem=21000M,node=1",
1,1,1,31497932,user_679,group_121,2023-03-21T11:31:18,Unknown,Unknown,00:00:00,PENDING,...,0,28,0,21000M,,1,1,None assigned,"billing=3192,cpu=28,gres/gpu=4,mem=21000M,node=1",
2,2,2,31993628,user_105,group_114,2023-03-22T18:19:12,Unknown,Unknown,00:00:00,PENDING,...,0,12,0,150G,,1,1,None assigned,"billing=23347,cpu=12,gres/gpu=8,mem=150G,node=1",
3,3,3,39087660,user_652,group_054,2023-04-04T13:09:10,Unknown,Unknown,00:00:00,PENDING,...,0,640,0,20G,,10,10,None assigned,"billing=3112,cpu=640,mem=20G,node=10",
4,4,4,59062820,user_188,group_046,2023-05-08T09:58:20,2024-01-01T00:58:57,2024-01-01T00:59:06,00:00:09,COMPLETED,...,360,40,40,8G,,1,1,skl-029,"billing=1245,cpu=40,mem=8G,node=1","billing=1245,cpu=40,mem=8G,node=1"


In [18]:
gpfspath =  "/mnt/research/CMSE495-SS24-ICER/file_system_usage/gpfs-stats/inode-size-age-jan-23"

# Use in-class package method to read in data and set number of rows
gpfs_20krows= analyzer.read_gpfs(gpfspath,nrows=20000)
gpfs_20krows

Unnamed: 0,Inode (file unique ID),KB Allocated,File Size,Creation Time in days from today,Change Time in days from today,Modification time in days from today,Acces time in days from today,GID numeric ID for the group owner of the file,UID numeric ID for the owner of the file
0,100663296,0,8,1447,1447,3131,1447,2035,762231
1,100663297,0,188,1447,1447,1937,1447,2010,614955
2,100663301,0,567,1447,1447,3142,1447,2035,762231
3,100663304,0,87,1447,1447,3142,1447,2035,762231
4,100663306,0,1689,1447,1447,1937,1447,2010,614955
...,...,...,...,...,...,...,...,...,...
19995,100689153,64,7584,1447,1447,3106,1447,2392,831677
19996,100689154,0,1780,1447,1447,2471,1447,2069,1000000092
19997,100689155,0,0,1447,1447,3131,1487,2035,762231
19998,100689156,64,13677,1447,1447,1820,1447,2019,638741


## Step 3: Use the DataAnalyzer Methods

With your data loaded into the DataAnalyzer instance, you can now use its methods to analyze the data.


1.  To identify users with many files in the GPFS data:

  Replace 100 with whatever file limit you're interested in. This method will return a DataFrame of users who have more files than the specified limit.
  The DataFrame returned by UsersWithManyFiles (users_with_many_files) will list users exceeding the specified file count, which can be useful for identifying heavy users or potential optimizations.

In [26]:
# Example usage on Default: 1000 rows of the gpfs pandas dataset
users_with_many_files = analyzer.UsersWithManyFiles(file_limit=100)

In [27]:
users_with_many_files.head()

Unnamed: 0,UID numeric ID for the owner of the file,Number of files
2,762231,254
0,614955,188
1,753559,106


3. Files Per User CDF Plot 

In [6]:
# Example usage on only 1,000 rows of streamed data
analyzer = DataAnalyzer(gpfspath = "/mnt/research/CMSE495-SS24-ICER/file_system_usage/gpfs-stats/inode-size-age-jan-23")

# parameter can be set to only subset of data or full gpfs data; output png will be in same directory
analyzer.files_per_user_gpfs(full_dataset=False)

gpfs path= /mnt/research/CMSE495-SS24-ICER/file_system_usage/gpfs-stats/inode-size-age-jan-23
slurm path= 


4. To group users based on their utilization patterns of the HPCC